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Summary 

Genetic reassortment plays a vital role in the evolution of the 
influenza virus and has historically been linked with the 
emergence of pandemic strains. Reassortment is believed to 
occur when a single host - typically swine - is simultaneously 
infected with multiple influenza strains. The reassorted viral 
strains with novel gene combinations tend to easily evade the 
immune system in other host species, satisfying the basic 
requirements of a virus with pandemic potential. Therefore, it is 
vital to continuously monitor the genetic content of circulating 
influenza strains and keep an eye out for new reassortants. We 
present a new approach to identify reassortants from large data 
sets of influenza whole genome nucleotide sequences and report 
the results of the first ever comprehensive search for 
reassortants of all published influenza A genomic data. 35 of 
the 52 well supported candidate reassortants we found are 
reported here for the first time while our analysis method offers 



new insight that enables us to draw a more detailed picture of 
the origin of some of the previously reported reassortants. A 
disproportionately high number (13/52) of the candidate 
reassortants found were the result of the introduction of novel 
hemagglutinin and/or neuraminidase genes into a previously 
circulating virus. The method described in this paper may 
contribute towards automating the task of routinely searching 
for reassortants among newly sequenced strains. 

© 2012. Published by The Company of Biologists Ltd. This is 
an Open Access article distributed under the terms of the 
Creative Commons Attribution Non-Commercial Share Alike 
License (http://creativecommons.0rg/licenses/by-nc-sa/3.O). 
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Introduction 

The influenza A virus has caused substantial morbidity and 
mortality in humans as well as livestock. Many severe influenza 
epidemics have been documented in the past 300 years of world 
history, while four pandemics and numerous other epidemics 
have occurred in the past century (Kilbourne, 2006; Potter, 
2001; Zimmer and Burke, 2009). Furthermore, each year 
seasonal influenza results in a considerable death toll 
worldwide. Spanish Flu, the first pandemic of the past 
century, killed around 50 million people and earned the 
honorary title "Mother of all pandemics" (Taubenberger and 
Morens, 2006). The evolutionary origin of the virus responsible 
for this calamity has always been subject to controversy and is 
still not fully resolved (Antonovics et al., 2006; Reid and 
Taubenberger, 2003; Taubenberger, 2006). 

It has been long established that genetic reassortment of the 
eight RNA segments that constitute the influenza A virus may 
produce novel viruses in nature (Desselberger et al., 1978). This 
happens when two or more viruses coinfect the same cell and 
plays a very important role in the long term evolution of the virus 
as well as in the making of global influenza pandemics. In 
particular, reassortment events in pigs may give birth to novel 
viruses with pandemic potential in humans (Castrucci et al., 
1993; Ma et al., 2008; Ma et al, 2009). In fact, all known 
pandemics in recent human history have been attributed as results 
of reassortment of genes between two or more distinct viruses, 



with concrete evidence in support of this fact for all except the 
1918 Spanish Flu pandemic (Kawaoka et al., 1989; Lindstrom et 
al., 2004; Scholtissek et al, 1978; Smith et al., 2009a). This was 
only possible due to the availability of genetic information of 
contemporary and precursor viral isolates. Furthermore, there is 
independent multiple evidence that the pandemic H1N1/2009 
virus has reassorted again in swine, possibly giving it a chance to 
escape herd immunity in humans (Ducatez et al., 2011; 
Vijaykrishna et al., 2010). Isolated human infections of such 
viruses have also been reported recently (Centers for Disease 
Control and Prevention (CDC), 2011). 

There is little argument about the necessity of continuously 
monitoring circulating influenza viruses for the possible 
emergence of new reassortant influenza A viruses. With the 
number of full genome influenza sequences in publicly available 
databases growing exponentially in recent years (http://www. 
ncbi.nlm.nih.gov/genomes/FLU/growth.html), there is little con- 
cern about the availability of data as well. Furthermore, with 
astronomic advances in sequencing technology within the last 
few years (Schuster, 2008), it will not be long before it becomes 
not only possible, but also the most reasonable thing to do, to 
sequence each individual viral genome we would encounter or 
have access to. Computing infrastructure development would be 
the need of the hour when such a situation unfolds, as a lot of data 
would be of no use without ways and means to interpret them 
efficiently and accurately. 
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However, very few attempts have been made to automate the 
identification of reassortment in the influenza virus. A recent 
paper (Nagarajan and Kingsford, 2011) described an interesting 
method to identify reassortments which employs a graph mining 
technique to find topological incongruities between collections of 
Markov chain Monte Carlo-sampled trees from different 
segments. It seemed to be technically sound and may, by itself, 
even prove to be sufficiently robust in handling very large data 
sets. However, the computational cost of phylogeny 
reconstruction (MCMC-sampling in this case) is a formidable 
obstacle towards using phylogeny dependent methods for 
identifying reassortments from very large data sets. We aim to 
overcome this problem by formulating a phylogeny independent 
method which uses only the nucleotide distance matrices as input. 

Virologists would usually check for reassortment in their full 
genome sequences by doing a homology search on a public 
influenza sequence database or by reconstructing the phylogenies 
with reference sequences for each of the segments (Holmes et al., 
2005; Karasin et al., 2000; Lindstrom et al., 2004). If the most 
homologous strains do not match across all segments, the 
possibility of the query sequence belonging to a reassorted virus 
is considered high. Our algorithm is based on the same principle, 
but we focus on the 'neighbours' on the phylogenetic tree of each 
segment rather than doing a homology search. The 
neighbourhood is defined as a fixed number of closest 
neighbours of a given strain on a given segment phylogeny. 
The number of common elements in the neighbourhoods of the 
same taxa on two different segments denotes whether the two 
segments originated from a single parent or not. A very low 
common neighborhood size is very often a sign of reassortment. 

Materials and Methods 

Definition of reassortment 

While genetic reassortment is a subject of wide interest in influenza research, it has 
not been clearly defined so far to our knowledge. We do not plan to give a 
rigourous mathematical definition here, but a solid image of what amounts to 
reassortment is imperative before setting out to detect them. Any influenza virus 
that has at least one pair of segments (out of all possible combinations of two out 
of eight) such that each segment is clearly derived from one of two distinct parents 
is what we call a reassortant - the direct product of a reassortment event - in this 
paper. It must be noted that, for all practical puiposes, a direct progeny of such a 
reassortant may not be distinguishable from the original strain under this 
definition, particularly if the original strain has not been sampled. Thus, it is 
practically impossible to fully ascertain when or where the reassortment event took 
place, unless there is sufficient clinical data to establish the viral transmission 
pathways surrounding the event. Here, we base our results solely on nucleotide 
sequence data, and refrain from making any distinction between strains with near 
identical sequences or reassortment patterns. 

Overview of the algorithm 

For any given flu genome sequence, the only straightforward practical test for 
reassortment is to find for each segment the 'closest 1 strains out of a reference set 
and compare the results across segments. For each segment, the set of r closest 
strains (including self) is called the r -neighbourhood for that segment (Fig. 1). 
When the /*-neighbourhoods of two segments are compared, the common elements 
in the two sets form the common neighborhood and its number of elements is 
referred to as the common neighborhood size. If this size is one, the two segments 
are deemed to have different ancestry; if it is more than one but comparatively 
smaller than r, there is still a high chance of different ancestry. We test all 
combinations of segment pairs and determine if a given strain is a reassortant or 
not. Highest sensitivity would be achieved by our method when all known 
complete genomic sequences are included in the reference set. We shall make a 
data set of all available complete genome sequences, and sequentially test each 
genome sequence using the rest of the set as the reference. 

Preparation of nucleotide sequence data 

All available full genome influenza A nucleotide sequences (as at 24th June 201 1) 
were downloaded from the FTP servers of Influenza Virus Resource at NCBI 




Fig. 1. Neighbourhoods of strain t in segments i,j and k. t falls on different 
lineages in segments i and j and have no common elements in their 
neighbourhoods except for strain t itself. Conversely, t has identical roots in 
segments i and k, giving rise to identical /--neighbourhoods. 



(http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html). Duplicate sequences 
were removed within each segment, and incomplete genomes were further removed 
to have 9284 sequences for each segment (see supplementary material Table S 1 for a 
complete list). The coding regions for Ml, M2, NS1 and NS2 proteins were treated 
as four individual segments (segment 7-10 respectively) for ease of enumeration. 
Both coding regions were used for these two segments for added confidence. The 
coding region of PB1-F2 was left out due to its relatively short length and lack of 
annotation or presence in some lineages. For each of the 10 segments, multiple 
alignment was performed by MAFFT and the results were further enhanced by 
manual editing. Phylogenetic analysis of each major coding region of the eight RNA 
segments (M2 and NS2 are not included) was performed as a separate study (de Silva 
etal., 2012). 

In order to minimize sampling bias, highly similar genome sequences (>97% in 
HA and NA; >98% in all other segments simultaneously) were removed, while 
retaining one genome to represent each class of such genome sequences. This 
resulted in a final dataset of 1670 representative sequences, on which all further 
analysis was performed. 

Algorithm 

The algorithm is described in detail along with definitions of the concepts 
involved. It is implemented by a Ruby script using Bioruby (Goto et al., 2010), 
which is available from the authors upon request. All calculations were performed 
on a Debian Linux server powered by 2 quad-core Intel Xeon X7560 / 2.27 GHz 
Processors with 264 GB RAM. The run time was approximately 16 hrs for the 
current data. 

Step 0. Calculation of genetic distances 

The genetic distances between all pairs of nucleotide sequences are calculated by 
Phylip dnadist version 3.67 using Jukes Cantor method and the respective distance 
matrices for each of the segments are consequently used in the analysis as 
described below. 

Step 1. Determination of the common neighbourhoods 

For strain I, let its r- neighbourhood in segment / — (1,..,10) be defined as 

4'}. — {sEQ\d('\t,s) <^''(/,^v,/-)} where Q. denotes the complete set of viral 
strains and d^'hs the distance measure on segment /', and 

(t,s tl i) <,d® (tjtz) ^ ■■• <,d®(t,s t jt) where n=|Q| and all the s t ,* are 
distinct. In other words, it is the set of r nearest strains to strain t with respect 
to the genetic distance calculated using the nucleotide sequence data from the V th 
segment. 

Moreover, for strain t, let M, — ffflQ be its common neighborhood size matrix 
defined by 



\ne^\fori,j= 



.10. 
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M, is calculated for all strains, which in turn is examined for signs of 
reassortment. wL is the ij- common neighbourhood size of strain t, which is the 
number of common elements in its r - neighbourhoods in segments i and /. 

Parameter tweaking 

The smaller the values of the m\j{\ <ij< 10), the stronger the argument for being 
a reassortant, but where to draw the line is not trivial. We experimented with 
different numbers and arrived at two separate search criteria, biased more towards 
sensitivity than specificity. The neighbourhood size r is 80 (approximately 5% of 
the number of taxa) in what follows. 

Step 2. Reassortment search 

Let mititi - 



mm f , 1 

<j<w\ m v\- 



max . , 
and max,, = 1 <j < 10 } tn< 1 , 
l*i 1 ' 

For each strain t, each row of its common neighbourhood matrix is inspected to 
see if it satisfies one or more of the following conditions. 
If min„<G.\r, and 

1) if max„<0.lr, t is a product of single segment reassortment.(S) 

2) if max„>min„ + 0.8r, t is a product of multiple segment reassortment. (M) 

Here, "single segment reassortment" refers to the event where one segment is 
uniquely derived from a parent virus different from those of the other segments, 
while "multiple segment reassortment" refers to the event where more than one 
segment are derived from a common parent different from those of the other 
segments. A typical reassortment event may be a mixture of one or several 
reassortments of these two types. 

Results and Discussion 

Our common neighbourhood approach easily picked up 
reassortants whose parent sequences were sufficiently distant 
and where the reassortment had not become fixed in the 
population. The common neighbourhood matrix of A/Swine/ 
Italy/I521/98(H1N2) is a good example (Table 1). In segment 
4(HA), it has only two elements in common between its segment 
4 neighbourhood and most of the other neighbourhoods, which is 
a good example of a clean reassortment event. Segment 6(NA) is 
very similar in its relationship with the other segments, with their 
biggest common neighbourhoods having eleven elements. The 
rest of the segments have a fairly high number of common 
'neighbours' amongst themselves. Altogether, this virus has three 
parents: two of them contributing one segment each and the last 
contributing the remaining segments. 

A comprehensive search of all available full genome sequence 
data (9284 strains represented by 1670 genome sequences) 
resulted in 280 hits (see supplementary material Table S2 for a 
complete list). These represented a total of 3086 influenza A 
strains in the original set, with the pandemic H1N1/2009 virus 



responsible for 2636 of them. Due to space constraints, we chose 
to show just 52 of them with the highest confidence by limiting to 
hits with 12 or more very small elements (size ^2) in their 
respective common neighbourhood size matrices (Table 2). 
Some of the reassortants that we found have already been 
reported, in which case we have shown the reference or 
commented about the annotation. In some other cases, the 
sequences have been published in a journal, but the reassortment 
has not been explicitly declared. Altogether, 35 of the total of 52 
reassortants are reported here for the first time, to the best of our 
knowledge. 

A/California/04/09, the reference strain for the pandemic 
H1N1/2009 virus, was easily picked up by the algorithm 
notwithstanding the huge sampling bias, while its reassortment 
pattern (Smith et al., 2009a) was subsequently determined 
correctly (Table 3). 

Of particular interest was an individual segment's propensity 
to reassort and acquire genetic information from a parent unique 
to itself or at most common to one more segment. The 7- 
l(eg:aaabaaaa) and 6-2(eg:aaababaa) reassortment patterns are 
the most typical of this kind. Segment 4 and 6, which code the 
HA and NA genes, tend to reassort in this way very often (13/52 
instances). 

Using our algorithm, we were able to identify further 
breakdowns in the ancestry of known reassortants. In A/Swine/ 
Ontario/53518/03 (Karasin et al., 2006), for example, we found 
that PB2 - as well as the previously reported PB1 - was derived 
from a unique parent of its own. In the 2005 triple reassortant 
H3N2 viruses from Canada (Olsen et al., 2006), we found that the 
PB1 gene was of a lineage distinct from that of the other 
polymerase genes and close to that of HA. Moreover, as a by- 
product of our analysis, we found that these swine viral 
sequences were very similar to A/turkey/Ontario/31232/ 
2005(H3N2), a contemporary avian virus from the same region, 
strongly suggesting cross species transmission. This finding was 
only possible owing to our comprehensive dataset spanning all 
hosts. 

Similarly, it is quite trivial to find influenza sequences 
"frozen" in time. A/USSR/90/1977(HlNl), one of the first 
H1N1 isolates after its re-emergence in 1977 (Zimmer and Burke, 
2009) after a 20 year lapse, happened to possess a genomic 
sequence very similar to that of A/Roma/ 1949(H1N1). 

Furthermore, it proved to be powerful enough to analyse 
complex reassortment patterns within closely related sequences, 
when an appropriate data set is used. For example, the predicted 



Table 1. Common neighbourhood size matrix of A/Swine/Italy/I521/98(H1N2). 



PB2 



PB1 



PA 



HA 



NP 



NA 



Ml 



M2 



NS1 



NS2 



sum 



PB2 

PB1 

PA 

HA 

NP 

NA 

Ml 

M2 

NS1 

NS2 




80 



51 10 
10 
10 



i so 



52 
55 
53 



8 


52 


11 


5 


46 


11 


2 


45 


6 


2 


38 





52 
11 



58 
43 
41 



46 
49 
49 



46 



11 

58 
80 
38 
42 



42 
50 
40 



45 



6 

43 
38 
80 
43 



40 
41 
40 



38 



6 
41 

42 
43 
80 



56 
61 
57 



11 

58 
58 
50 
43 



354 
381 
4 



373 
344 
309 
293 



The first ten columns make up the common neighbourhood matrix, where each entry gives the number of common elements in the two neighbourhoods of the 
given strain from the respective segments. The last three columns give the minimum, maximum, and sum of each row excluding the diagonal entry. 
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Table 2. List of predicted reassortant strains with strong confidence. All strains with a reference have been reported previously 
unless otherwise noted. Strains that do not have a reference or a specific remark have not been reported to date, to the best of 

our knowledge. 



Year 



Strain 



Subtype #nbds Type Ref 



Comments 



Avian 



1966 


A/ equine/Lexington/ 1/1966 


H7N7 


12 


M 


1966 


A/turkey/Ontario/62 1 3/1966 


mixed 


14 


M 


1968 


A/turkey/Wisconsin/1/1968 


H5N9 


12 


S 


1976 


A/pintail duck/ALB/86/1976 


H3N2 


18 


M 


1976 


A/mallard duck/ALB/57/1976 


H5N2 


16 


M 


1979 


A/pintail duck/ALB/628/1979 


H6N8 


18 


S 


1982 


A/blue-winged teal/ALB/685/1982 


H6N4 


16 


S 


1987 


A/turkey/MO/2 1939/1987 


H1N1 


18 


s 


1988 


A/mallard duck/ALB/321/1988 


H9N2 


12 


s 


1992 


A/duck/Nanchang/ 1 68 1 / 1 992 


H3N8 


14 


M 


1994 


A/duck/NY/1 3 1 52- 1 3/ 1 994 


H1N1 


16 


s 


1996 


A/Goose/Guangdong/1/96 


H5N1 


_ 


M 


1999 


A/duck/Shimane/188/1999 


H1N1 


34 


S 


2000 


A/mallard/Netherlands/02/2000 


H10N7 


18 


s 


2001 


A/duck/Hokkaido/ 1 20/200 1 


H6N2 


18 


M 


2001 


A/chicken/Kobe/aq26/200 1 


H9N2 


16 


M 


2002 


A/mallard/Alberta/149/2002 


H2N4 


14 


S 


2002 


A/environment/Hong Kong/674. 15/2002 


H5N1 


16 


s 


2004 


A/chicken/Henan/0 1 /2004 


H5N1 


14 


s 


2005 


A/domestic green- winged teal/Hunan/79/2005 


H5N1 


18 


M 


2005 


A/teal/Italy/393 1-38/2005 


H5N2 


16 


s 


2005 


A/common murre/Oregon/1 9497-004/2005 


H9N5 


16 


M 


ZUUj 


A/whooper swan/Mongolia/232/2005 


riiziN d 


zu 


M 


2006 


A/mallard/Netherlands/30/2006 


H1N4 


12 


M 


2006 


A/pekin duck/California/P30/2006 


H4N2 


22 


S 


2006 


A/mallard/Pennsylvania/454069- 1 2/2006 


H5N4 


14 


M 


2006 


A/northern shoveler/Netherlands/1/2006 


H8N4 


12 


M 


2006 


A/chicken/Pakistan/UDL-01/2006 


H9N2 


16 


M 


2006 


A/sanderling/Delaware/449/2006 


H9N2 


16 


S 


2006 


A/shorebird/Delaware/249/2006 


H9N2 


12 


S 


2007 


A/chicken/Laos/PO 1 30/2007 


H5N1 


16 


M 


2007 


A/environment/Hunan/5-32/2007 


H5N1 


26 


M 


2007 


A/little egret/Hong Kong/8550/2007 


H5N1 


18 


M 


2007 


A/garganey/Altai/ 1 2 1 3/2007 


H5N2 


18 


S 


2007 


A/shorebird/Delaware/554/2007 


H9N1 


14 


S 


2007 


A/chicken/Hubei/C 1/2007 


H9N2 


14 


M 


2007 


A/mallard/Maryland/798/2007 


mixed 


22 


S 


2007 


A/burasian wigeon/Netherlands/4/2007 


H10N1 


16 


M 


2007 


A/mallard/Netherlands/ 1 7/2007 


H11N8 


14 


M 


2008 


A/environment/Hunan/6-69/2008 


H5N1 


24 


M 


2008 


A/peregrine falcon/Hong Kong/2142/2008 


H5N1 


26 


M 


200o 


A/nortnem snovelei/mteiior AlasKa/8BM.34 /0/Z008 


H9N2 


1 2 




2009 


A/mallard/Hokkaido/24/2009 


H5N1 


32 


M 


2009 


A/goose/Czech Republic/ 1 848-K9/2009 


H7N9 


12 


S 


Swine 










1998 


A/swine/Italy/1521/98 


H1N2 


12 


S 


2002 


A/swine/Guangdong/ 102/2002 


H3N2 


20 


M 


2004 


A/swine/Guangxi/wz/2004 


H5N1 


14 


M 


2004 


A/swine/Guangdong/wxl/2004 


H9N2 


22 


M 


2005 


A/swine/Shandong/3/2005 


H3N2 


20 


M 


2008 


A/swine/Shandong/ 1 123/2008 


H1N1 


16 


M 


Human 










1976 


A/New Jersey/1976 


H1N1 


12 


M 


Others 










2010 


A/canine/Korea/1/2010 


H3N1 


14 


M 



(Obenauer et al., 2006) 
(Obenauer et al., 2006) 
(Obenauer et al., 2006) 
(Obenauer et al., 2006) 



Genotype:H3,5,6Nl 

Reassortment not mentioned 
Reassortment not mentioned 
Reassortment not mentioned 
Reassortment not mentioned 



(Hatchette et al., 2004; Suggests reassortment 
Obenauer et al., 2006) 

(Obenauer et al., 2006) Reassortment not mentioned 

(Li et al., 2004; Mukhtar Progenitor of HPAI 

et al., 2007; Xu et al., 

1999) 



(Mase et al., 2007) Reassortment not mentioned 

(Hatchette et al., 2004) Reassortment not mentioned 



(Chen et al., 2009) 



(Spackman et al., 2009) Reassortment not mentioned 



(Iqbal et al., 2009) 



(Boltz et al., 2010) 



(Smith et al, 2009b) 



(Wu et al., 2008) 



Annotation implies reassortment 



Serotype: H9N1; 
Genotype:H3,5,9Nl 



Annotation implies reassortment 

(Smith et al, 2009b) 

(Yamamoto et al., 2011) Suggests reassortment 



(Marozin et al., 2002) 



(Lu et al., 2010) 



Reassortment not mentioned 



M (Gaydos et al., 2006) 1976 Fort Dix flu outbreak strain 



Seq. annotated as reassortant 
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Table 3. Reassortment patterns of selected swine and human strains. 





Strain 


Subtype 


PB2 


PB1 


PA 


HA 


NP 


NA 


M 


NS 


Swine 
























A/swine/Italy/1521/98 


H1N2 


a 


a 


a 


b 


a 


c 


a 


a 




A/swine/Guangdong/ 1 02/2002 


H3N2 


a 


a 


a 


a 


a 


a 


b 


a 




A/swine/Guangxi/wz/2004 


H5N1 


a 


a 


a 


a 


b 


a 


a 


a 




A/swine/Guangdong/ wxl/2004 


H9N2 


a 


a 


b 


a 


a 


a 


c 


a 




A/swine/Shandong/3/2005 


H3N2 


a 


b 


b 


b 


b 


b 


b 


c 




A/swine/Shandong/ 1 123/2008 


H1N1 


a 


a 


a 


a 


a 


a 


a 


b 


Human 
























A/New Jersey/1976 


H1N1 


a 


b 


a 


b 


a 


b 


a 


a 




A/California/04/2009 


H1N1 


a 


a 


a 


a 


a 


b 


b 


a 



reassortment patterns for Clade B (A/New York/3 2/2003, A/New 
York/198/2003 and A/New York/199/2003), Clade C (A/New 
York/52/2004 and A/New York/59/2004) and A/New York/1 1/ 
2003 from a comprehensive phylogenetic study of 156 complete 
genomes of H3N2 influenza A collected between 1999 and 2004 
from New York (Karasin et al., 2000) are a perfect match with 
the patterns that were previously inferred by examining their 
phylogenies (data not shown). 

Sample bias is a major confounding factor in molecular 
evolutionary analyses, particularly so in our reassortment search. 
The number of isolates available from the first half of the 20th 
century is very scarce, making it difficult to determine 
evolutionary lineages. This is exacerbated by our fixed 
neighbourhood size of 80, which is too big for sparsely 
sampled lineages. We actually did not have any hits from that era. 

In a preliminary analysis with a smaller data set, the oldest 
influenza A strain in the database, A/Brevig Mission/1/1918, was 
picked up by our algorithm, in spite of the fact that, by definition, 
its ancestry and reassortment history could not be directly 
determined by the available data. This is a result of our 
neighbourhoods consisting of both ancestors and descendants, 
when only ancestors define a given strain's reassortment history. 
This would potentially pose problems in highly reassortment 
driven lineages. For example, A/Goose/Guangdong/1/96 (Gs/Gd/ 
1/96), the precursor of the recent HPAI H5N1 lineage in Asia (Li 
et al., 2004) has passed various combinations of its gene 
segments to a few generations of multiple reassortants, which 
did adversely affect our grouping of its own segments by 
ancestry. Direct descendants could also negatively affect the 
output when the reassorted genes get fixed in the population. 
Conversely, such direct descendants of reassortant strains may be 
wrongly selected as reassortants themselves. 

The ideal solution for this problem is to have only ancestors in 
the neighbourhoods. However, it is not possible to distinguish 
ancestors from descendants from our distance matrices alone. It 
would require the construction of all the phylogenies with 
additional assumptions about the relative rate of evolution on 
each branch. 

Minor topological and distance inconsistencies may occur 
across segments in phylogenies even without a reassortment 
event, due to stochastic errors and limitations in distance 
estimation methods. We need to allow for such minor 
inconsistencies so that our algorithm does not wrongly pick up 
strains that are in fact not reassortants. To this end, we must avoid 
too small a neighbourhood size, thereby allowing movements 
upto a certain degree to occur without being considered as results 
of reassortment. Too large a neighbourhood size would, on the 



other hand, not detect small movements that are actually 
reassortment driven and may give distorted results when the 
immediate surroundings are sparsely sampled. After much 
deliberation, we decided to use a neighbourhood size of 
approximately 5% of the data set, which seemed to work 
reasonably well. Perhaps, a neighbourhood size that varies across 
lineages by sampling density would be a potential improvement 
to our algorithm. 

The property of common ancestry should be transitive over all 
segments in order to group the segments by ancestry without 
confusion, (ie. if and j have common ancestry, and j and k have 
common ancestry, then and k should also have common 
ancestry). Nevertheless, many of our results do not satisfy this 
criteria, which is no wonder given the fact that we use a common 
cutoff value for all segment combinations and all lineages. 
Hence, we have had to assume transitivity in some cases before 
assigning the ancestry of each segment, (ie. we assume i and k 
have common ancestry even if the common neighbourhood size 
falls below the cutoff, provided there exists a segment j that has 
common ancestry with both of them). 

We have tried to reconstruct the phylogenies for our data using 
MrBayes (Huelsenbeck and Ronquist, 2001) as described in the 
GiRaF paper (Nagarajan and Kingsford, 2011) and found the 
computation time till convergence with sufficient mixing to be at 
least in the order of months per segment on a single processor 
machine (data not shown). It seems inevitable that we would have 
to settle for phylogeny independent methods at current processor 
speeds, when doing comprehensive analyses of influenza 
genomic data. One such earlier method (Rabadan et al., 2008) 
seemed to perform well in detecting reassortants within lineages, 
but no comprehensive study has been undertaken to date using 
this method. 

In this paper, we demonstrate our algorithm using a 
comprehensive complete genome data set, and strive to find the 
reassortants within that data set while using the same data for 
reference. The same algorithm may be used to check any given 
new influenza A strain with a complete genome sequence for 
reassortment. If this algorithm is to be used for that purpose, it is 
imperative that the reference data set is always maintained up to 
date. We believe this method could be efficiently utilized for 
rapidly testing high throughput sequence data if the need 
arises. 
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